From: avie@next.com (Avadis Tevanian)
Newsgroups: comp.sys.next.misc
Subject: Re: Why does NS require so much Memory?
Date: 6 Jun 1994 04:38:53 GMT

In article <1994Jun5.221433.24748@sifon.cc.mcgill.ca> samurai@cs.mcgill.ca
(Darcy BROCKBANK) writes:
> Oh well... can someone more informed than me
> *please* take up this discussion, because I don't have
> enough knowledge on this to come to the correct conclusion.

Here's the facts on how swapfiles work:

For every page in the swapfile, the kernel maintains status telling whether
that page is in use or not.  When a swapfile it enabled (mach_swapon), it is
truncated to lowat and each page is flagged as free.  When the page out daemon
requests a page to be swapped out, the pager locates the first free page in the
swapfile (actually, there is an algorithm to determine which swapfile is used,
if more than one is enabled, but I will omit this from the discussion).  The
first free page is defined as the lowest numbered page.  As more and more
memory is consumed by processes, higher and higher numbered pages are used.
When all pages in the swapfile are in use, and additional page out causes the
swapfile to be extended in size.  This occurs until hiwat is reached.  If hiwat
is reached, or if the file system is out of space, the page will be left in
memory (unless there is another swapfile enabled that can be used).  If the
system stays in this state, it will eventually be full of dirty pages which can
not be paged out.  When this happens, the system comes to a grinding halt as it
is forced to use fewer and fewer pages of memory (memory is filled with dirty
pages that can not be paged out).

Now, it gets interesting when we consider what happens when memory is freed.
In particular, when a process exits or calls vm_deallocate, the VM system
attempts to free any memory that was associated with the appropriate regions of
virtual memory.  When memory is shared, it simply makes a note that there is
one fewer reference to the shared memory (or copy-on-written memory) and no
further action is taken.  If this is the last reference to the memory, any
corresponding physical pages are freed from main memory and any corresponding
pages in the swapfile are tagged as free.  A subsequent allocation of page on
the swapfile will most definitely reuse this page!

When a page is freed, if it is the highest page in the swapfile, the swapfile
will be truncated all the way down to the highest page in use (down to lowat).
In practice, this happens rarely.  The basic problem is that if you have a long
running process use a very high number paged (e.g., if the Windowserver
allocates a high numbered page) the swapfile will not get truncated until that
process exits --- which could be a very long time.  When this happens due to a
core process (e.g., the nmserver), which cannot be restarted unless the system
is rebooted, your swapfile will remain large.  Still, there can be lots of free
pages in the swapfile file, and rest assured they will be reused!

So why don't we compact the swapfile to handle these pages that get allocated
at high page numbers?  Good question.  We've considered doing it many times.
However, it has always been considered a quite risky change (how many of YOU
have debugged a virtual memory system before) and would need to be done very
carefully to ensure correctness and adequate performance.  As an example, it
would not be acceptable to just start a compaction and cause the system to lock
up as the kernel does several megabytes of I/O for the compaction.  The
relative merits of making this improvement has never outweighted the costs in
risk and the opportunity costs of not working on other parts of the system.
I'm not saying we'll never do it, I'm just saying we haven't done it yet for
some carefully considered reasons.

Having said all of this, why do so many people seem to have problems with their
swapfiles?  Here are some possible explanations:

1)  Not everyone realizes just how much memory their apps use.  As has been
mentioned before, the Windowserver keeps backing store for all the windows (on
or off screen).  On 16-bit color systems this can be quite large, on 24-bit
systems its downright huge!  Simple images on the screen can translate into
megabytes of storage.  Mathematica sessions are notorious for consuming 10's or
even 100's of megabytes of VM.

2)  Programs occasionally have memory leaks.  We work hard to be sure that the
software we release does not have leaks.  There's a reason we developed
MallocDebug!  I think we do pretty well, but I'm sure there are some bugs.  For
example, the Windowserver, with it's printer heritage, has long had problems
with correctly managing its memory.  On the printers they just "reset" the
memory heap for each new job --- we can't do that.  If/when the Windowserver
leaks we get a double whammy since not only do we leak a small amount of
memory, but the Windowserver is a long running process and tends to hog those
high numbered pages.  I think NEXTSTEP ISV's generally do a good job too, but
it only takes one or two apps to leak memory and cause problems.

3)  As many of you know, Mach has a quite advanced virtual memory scheme, which
NEXTSTEP makes excellent use of.  Features like copy-on-write and pageable
read/write sharing can cause complex relationships between memory and how it is
mapped into one or more processes.  There is one known optimization that the
kernel does (specifically the coalescing of adjacent memory regions when
backing store has not yet been allocated --- for those of you Mach VM literate)
which sometimes causes the freeing of some memory to be delayed until a process
has exited.  The situations when this happens are fairly rare, and worse case
the memory is freed when the process exits, but it wouldn't surprise me if this
is the cause of isolated problems.

I personally think the Mach swapfile solution is quite good.  I'm obviously
biased though.  Sure, there are a few things I think could be improved, but
that's true of any piece of software.  Overall I think we've made some
reasonable trade-offs.  I also think swapfile management is fairly bug-free.
We know we can improve the situation is (3) above (but it is difficult).
Certainly if anyone has any other possible reasons for swapfile growth,
especially with concrete examples of programs, let us know so we can
investigate!

I'd be more than happy to read suggestions others have on improving how
swapfiles work.  I can't guarantee we'll implement them, but you never know!

I hope this sheds a little light on the whole swapfile discussion.  Somehow I
think it will still continue on --- but hopefully it can be grounded with a few
more facts now.

         Avie